Zillow's Zestimate is created to give consumers as much information as possible about homes and the housing market, marking consumers had access to this type of home value information. It's based on 7.5 million statistical and machine learning models that analyze hundreds of data points on each property. We will develop an algorithm to help push the accuracy of the Zestimate even further.
Team members:
A home is often the largest and most expensive purchase a person makes in his or her lifetime. Ensuring homeowners have a trusted way to monitor this asset is incredibly important. The Zestimate was created to give consumers as much information as possible about homes and the housing market, marking the first time consumers had access to this type of home value information at no cost. For this competition, we will develop an algorithm that makes predictions about the future sale prices of homes.
In [14]:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np
import missingno as msno
import xgboost as xgb
from sklearn.preprocessing import scale
from xgboost.sklearn import XGBRegressor
from sklearn.linear_model import Ridge
from sklearn import cross_validation, metrics #Additional scklearn functions
from sklearn.model_selection import GridSearchCV #Perforing grid search
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
from sklearn.svm import SVR
import itertools
from sklearn.model_selection import KFold, cross_val_score
plt.style.use('ggplot')
%matplotlib inline
The dataset we used is provided by Zillow price. The response variable is logerror, which is defined as $logerror = \log(Zestimate)-\log(SalePrice)$. There are 58 predictor variables available to be chosen. Some of them are qualitative variables, and there's heavy missing value situation in lots of variables. Here, we will do some explorative work for furthur modelling.
The part will cover following topics.
In [3]:
train_df = pd.read_csv("originalinput/train_2016.csv", parse_dates=["transactiondate"])
properties = pd.read_csv('originalinput/properties_2016.csv')
In [8]:
train_df.tail()
Out[8]:
In [6]:
train_df.shape, properties.shape
Out[6]:
The train dataset includes 90811 housing trade records from 2016-01-01 to 2016-12-30. In the same time the test datset date is from 2016-10 to 2017-12. The property datatable includes 298517 properties in three places:.... So we neeed to merge the property into train_df to find the correspoinding predictor information.
In [7]:
train_df = pd.merge(train_df, properties, on='parcelid', how='left')
train_df.head()
Out[7]:
In [4]:
train_df['month']=[a.month for a in train_df['transactiondate']]
monthcount = train_df.groupby(train_df['month']).count()['logerror']
month = np.arange('2016-01', '2017-01', dtype='datetime64[M]')
fig, ax = plt.subplots(figsize=(5,3))
ax.bar(range(1,13),monthcount,color='red')
ax.vlines(10,0,10000)
ax.set_xlim(1,13)
ax.set_xticks(range(1,13))
ax.set_title("Distribution of transaction dates")
Out[4]:
We draw a barplot to see the monthly trade frequency of our dataset. Obviously there are much fewer training data from Oct 2016 to Dec 2016, because they split part of them as test data. The transaction number of first two months are less than other months, which implies the time tendency of housing transaction.
In [8]:
missing_df = train_df.isnull().sum(axis=0).reset_index()
missing_df.columns = ['column_name', 'missing_count']
missing_df['missing_ratio'] = missing_df['missing_count'] / train_df.shape[0]
deletecolumn = missing_df[missing_df['missing_ratio']>0.25]["column_name"]
#deletecolumns are the columns whose data is missing more than 25%.
In [9]:
#missing ratio of variables chosen
usetrain_df = train_df[['bathroomcnt', 'bedroomcnt', 'calculatedbathnbr',
'calculatedfinishedsquarefeet', 'finishedsquarefeet12', 'fips',
'fullbathcnt', 'latitude', 'longitude', 'lotsizesquarefeet',
'propertylandusetypeid', 'rawcensustractandblock', 'regionidcity',
'regionidcounty', 'regionidzip', 'roomcnt', 'yearbuilt',
'structuretaxvaluedollarcnt', 'taxvaluedollarcnt', 'assessmentyear',
'landtaxvaluedollarcnt', 'taxamount', 'censustractandblock']]
missingValueColumns = usetrain_df.columns[usetrain_df.isnull().any()].tolist()
msno.bar(usetrain_df[missingValueColumns],\
figsize=(24,8),color='black',fontsize=10)
In [10]:
msno.matrix(usetrain_df[missingValueColumns],width_ratios=(10,1),\
figsize=(20,8),color=(0,0, 0),fontsize=12,sparkline=True,labels=True)
In [11]:
msno.geoplot(usetrain_df, x='longitude', y='latitude')
We decide to give up variables whose missing ratio is more than 25%. Here, we just show the missing data pattern in variables that we choose. The first plot shows the missing rate of these variables, whose black part means data exist. Based on the visualization of data missing situation with records, there's some pattern about missing between variables in plot 2. In many records, value of variables are missing together, like 'finishedsquarefeet12' and 'lotsizesquarefeet'. So we can't simply impute data according to their own distribution. We also draw the geographical graph about missing situation. It automatically splits the dataset into statistically significant chunks and colorizes them based on the average nullity of data points within them. Later we will compare and find the best way to impute rest missing data.
Here we will explore the relationship between some important variables. These variables are gotten through XGBoost Regression from an initial experimental model which includes all variables left.
In [12]:
corrMatt = train_df[['finishedfloor1squarefeet','numberofstories','regionidcity','poolcnt','regionidzip',
'buildingclasstypeid','calculatedbathnbr','fireplacecnt','taxdelinquencyflag','pooltypeid10',
'pooltypeid7','structuretaxvaluedollarcnt','propertyzoningdesc','finishedsquarefeet13',
'garagetotalsqft','censustractandblock','finishedsquarefeet6','calculatedfinishedsquarefeet',
'taxdelinquencyyear','latitude']].corr()
mask = np.array(corrMatt)
mask[np.tril_indices_from(mask)] = False
fig,ax= plt.subplots()
fig.set_size_inches(8,5)
sns.heatmap(corrMatt, mask=mask,vmax=.4, square=True)
Out[12]:
From this correlation matrix, some important variables are positive related. So we need to use Lasso or other penalty function to reduce this multicolinearity effect.
First, let's look at the distribution of response variable -- logerror. Removing the influence of outliners, this histogram shows that logerror has a nice normal like distribution, which fits the assumption of the linear regression model. This datset only includes trade records for 1 year. There may exists a seasonal effect in the logerror. But considering the period, we won't use this information to train our model.
In [195]:
#upper = np.percentile(train_df.logerror.values, 97.5)
#lower = np.percentile(train_df.logerror.values, 2.5)
train_df['logerror'].loc[train_df['logerror']>upper] = upper
train_df['logerror'].loc[train_df['logerror']<lower] = lower
fig,ax = plt.subplots()
fig.set_size_inches(8,3)
sns.distplot(train_df.logerror.values, bins=50,kde=False,color="blue",ax=ax)
ax.set(xlabel='logerror', ylabel='Frequency',title="Distribution Of Response")
Out[195]:
In [204]:
traingroupedMonth = train_df.groupby(["month"])["logerror"].mean().to_frame().reset_index()
fig,ax1= plt.subplots()
fig.set_size_inches(10,3)
sns.pointplot(x=traingroupedMonth["month"], y=traingroupedMonth["logerror"], data=traingroupedMonth, join=True)
ax1.set(xlabel='Month Of Year 2016', ylabel='logerror',title="Average logerror",label='big')
Out[204]:
Based on the result of explorative variable analysis, we already know that our response variable is normal-like distributed. And We select predictors whose missing rate is less than 25% roughly. Now we want to deal with our X variables: factorize categorical variables, figure out the importance of these variables and then select some of them as our final predictors.
After checking the meaning of these variables, the following variables are cateogircal: fips, propertylandusetypeid, regionidcounty. There are only three values in regionidcounty: 3101,1286 and 2061. So we just factorise it as 0,1 and 2. The value of propertyliandsetypeid are meaningless. And there are so many unique values, we need to sort them into larger classes based on their meanings and frequencies. And then, we use oneencode method to make them as dummy variables. Consider the property of fips(Federal Information Processing Standard code),we try to use this code to get the geography information of these properties.
To figure out the relationship between logerror and other variables, we use xgboost model to see the importance of these variables. We use correlation matrix to see whether some variables are basically independent with logerror in the beginning, but the result shows that their correlations with logerror are so similar that we can't tell which one is larger. So, we use this way to do some digging. The code part is very similar as the xgboost model we build, therefore we omit them and simply explain the result. Here the dataset we use includes all the variables in case some extremely important variables will be omitted by the rough selection. All the missing values are filled as -999. The first 10 important variables are 'finishedfloor1squarefeet','numberofstories','regionidcity','poolcnt','regionidzip','buildingclasstypeid','calculatedbathnbr','fireplacecnt','taxdelinquencyflag','pooltypeid10'. These variables involves different aspects of housing situations, like location, housewares, housing size, interior design, etc. We will consider these situations when we choose our final features.
For this part, we use three different regression method: ridge, randomforest, support vector machine to train the model, then we using the stacking method to combine them together. For the dataset, we fill missing data with mode since which will provide the best prediciton. For each seperate regression method, we use grid search cross validation to find the best parameters that are needed in final model. The best parameters are shown below, when we upload the predicted value, the logerror returned is 0.0870198.
Ridge | RandomForest | SVM | |||
---|---|---|---|---|---|
parameter | value | parameter | value | parameter | value |
alpha | 1 | max_depth | 14 | bandwidth | 8.07e-05 |
max_leaf_nodes | 400 | ||||
min_samples_leaf | 1 | ||||
min_sample_split | 5 |
In [31]:
X = pd.read_table('adjustedinput/mode/x_train_mode.csv',sep=',')
X = X.iloc[:,1:]
y = pd.read_csv('adjustedinput/train_y.csv')
X_train = scale(X)
y_train = np.array(y).ravel()
In [ ]:
ridge_test = {
'alpha':[1e-15, 1e-10, 1e-8, 1e-5,1e-4, 1e-3,1e-2, 1, 5, 10,30,50,100]}
gsearch = GridSearchCV(estimator = Ridge(),
param_grid = ridge_test,n_jobs=4,iid=False, cv=5,scoring='neg_mean_squared_error')
gsearch.fit(X_train,y_train)
gsearch.grid_scores_, gsearch.best_params_, gsearch.best_score_
1) Tune Max_depth and max_leaf_node
In [3]:
maxd, maxleaf, scores = [], [], []
for i,j in itertools.product(range(2,10,2),[10,50,100,200,500]):
RFC=RandomForestRegressor(max_depth=i,max_leaf_nodes=j, min_samples_leaf=1,
min_samples_split=2,min_weight_fraction_leaf=0.0,oob_score=True)
score = RFC.fit(X_train,y_train).oob_score_
maxd.append(i); maxleaf.append(j);scores.append(abs(score))
print('max_depth:',i,'max_leaf_nodes:',j,'score:%.4f'%abs(score))
index=np.argmax(scores)
print('best:','max_depth:',maxd[index],'max_leaf_nodes:',maxleaf[index],'score',scores[index])
the answer show that for both value, the higher the node and depth, the better the performance, in order to find the optimal one, I will try to set higher value.
In [116]:
maxd, maxleaf, scores = [], [], []
for i,j in itertools.product(range(8,20,3),[150,200,250,300,400]):
RFC=RandomForestRegressor(max_depth=i,max_leaf_nodes=j, min_samples_leaf=1,
min_samples_split=2,min_weight_fraction_leaf=0.0,oob_score=True)
score = RFC.fit(X_train,y_train).oob_score_
maxd.append(i); maxleaf.append(j);scores.append(abs(score))
print('max_depth:',i,'max_leaf_nodes:',j,'score:%.4f'%abs(score))
index=np.argmax(scores)
print('best:','max_depth:',maxd[index],'max_leaf_nodes:',maxleaf[index],'score',scores[index])
the answer show that the best max_depth is 14, and best max_leaf_nodes are 400.
2) Tune min_samples_leaf and min_samples_spl
using the best parameter for best max_depth and best max_leaf_nodes, we can continually tune other parameters.
In [235]:
minleaf, minsplit, scores = [], [], []
for i,j in itertools.product([1,10,30],[2,5,10,50]):
RFC=RandomForestRegressor(max_depth=14,max_leaf_nodes=400, min_samples_leaf=i,
min_samples_split=j,min_weight_fraction_leaf=0.0,oob_score=True)
score = RFC.fit(X_train,y_train).oob_score_
minleaf.append(i); minsplit.append(j);scores.append(abs(score))
print('min_samples_leaf:',i,'min_samples_spl:',j,'score',"%.6f"%abs(score))
index=np.argmax(scores)
print('best:','min_samples_leaf:',minleaf[index],'min_samples_spl:',minsplit[index],'score',scores[index])
best parameter for random forest:
In [9]:
import hpsklearn
from hpsklearn import HyperoptEstimator, svc
from hyperopt import tpe
from sklearn.svm import SVR
In [10]:
X_train = X_train.astype('float64')
y_train = y_train.astype('float64')
In [11]:
mysvr = SVR()
estim = HyperoptEstimator( regressor =hpsklearn.components.svr('mysvr'),
algo=tpe.suggest, trial_timeout=100)
estim.fit(X_train,y_train)
In [12]:
estim.best_model()
Out[12]:
In [13]:
from mlxtend.regressor import StackingRegressor
In [15]:
ridge = Ridge(alpha=1)
svr = SVR(C=8.07639000981e-05, cache_size=512, coef0=0.0, degree=1,
epsilon=561.857419396, gamma='auto', kernel='linear',
max_iter=76417421.0, shrinking=False, tol=0.00924829940784,
verbose=False)
rf = RandomForestRegressor(max_depth=14,max_leaf_nodes=400, min_samples_leaf=1,
min_samples_split=5)
stregr = StackingRegressor(regressors=[ridge,rf],meta_regressor=svr)
stregr.fit(X_train, y_train)
ypred = stregr.predict(X_test)
print("Mean Squared Error: %.4f"% np.mean((ypred - y_test) ** 2))
In [16]:
test = pd.read_table('originalinput/sample_submission.csv',sep=',')
test_df_remaining = pd.read_table("adjustedinput/mode/test_df_remaining_mode.csv",sep=',')
test_df_remaining = test_df_remaining.iloc[:,1:]
In [15]:
test_df_remaining.head()
Out[15]:
In [ ]:
#estimate reponse variable in test data
y_predict_stack = stregr.predict(test_df_remaining)
for i in range(1,7):
test.iloc[:,i]=y_predict_stack
For this part, we firstly fill the missing data with three different value, the mean, median and mode. Then, for all these dataset, we use grid search cross validation to find the best parameters one by one. We only provide the training process of fill-with-median model here. The best parameter for each model and predcition value are listed below, the answer show that fill missing value with mode will provide the best answer.
Mean | Median | Mode | |
---|---|---|---|
learning_rate | 0.1 | 0.1 | 0.1 |
n_estimators | 140 | 1000 | 140 |
max_depth | 5 | 5 | 5 |
min_child_weight | 6.5 | 6 | 8.5 |
gamma | 0 | 0.1 | 0.01 |
subsample | 0.85 | 0.85 | 0.8 |
colsample_bytree | 0.85 | 0.85 | 0.75 |
reg_alpha | 5 | 5 | 1 |
reg_lambda | 100 | 50 | 500 |
scale_pos_weight | 1 | 1 | 1 |
logerror | 0.0660177 | 0.0673185 | 0.0657067 |
In [11]:
def feat_imp_plot(xgbmodel):
plt
xgbmodel.fit(train[predictors], train[target],eval_metric='auc')
feat_imp = pd.Series(xgbmodel.booster().get_fscore()).sort_values(ascending=False)
plt.figure(figsize=(15,5))
#feat_imp = pd.Series(xgb1.booster().get_fscore()).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')
In [3]:
train = pd.read_table("x_train_mode.csv",sep=',')
y = pd.read_table("train_y.csv",sep=',')
train['logerror']=y.iloc[:,0]
target = 'logerror'
train = train.iloc[:,1:]
predictors = [x for x in train.columns if x not in [target]]
In [4]:
train.head()
Out[4]:
In [23]:
param_test1 = {
'max_depth':list(range(3,10,2)),
'min_child_weight':list(range(1,6,2))
}
gsearch1 = GridSearchCV(estimator = XGBRegressor(learning_rate =0.1, n_estimators=140, max_depth=5,
min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8, nthread=4, scale_pos_weight=1, seed=27),
param_grid = param_test1, scoring='neg_mean_squared_error',n_jobs=4,iid=False, cv=5)
gsearch1.fit(train[predictors],train[target])
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_
Out[23]:
Here, we have run 12 combinations with wider intervals between values. The ideal values are 3 for max_depth and 3 for min_child_weight. Lets go one step deeper and look for optimum values. We’ll search for values 2 below the optimum values because I find there exist the trend of lower the best.
In [25]:
param_test2 = {
'max_depth':[3,4,5,6,7],
'min_child_weight':[3,4,5,6,7]
}
gsearch2 = GridSearchCV(estimator = XGBRegressor(learning_rate =0.1, n_estimators=140, max_depth=5,
min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8, nthread=4, scale_pos_weight=1, seed=27),
param_grid = param_test2, scoring='neg_mean_squared_error',n_jobs=4,iid=False, cv=5)
gsearch2.fit(train[predictors],train[target])
gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_
Out[25]:
In [28]:
param_test3 = {
'gamma':[i/10.0 for i in range(0,5)]
}
gsearch3 = GridSearchCV(estimator = XGBRegressor( learning_rate =0.1, n_estimators=140, max_depth=5,
min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,nthread=4, scale_pos_weight=1,seed=27),
param_grid = param_test3, scoring='neg_mean_squared_error',n_jobs=4,iid=False, cv=5)
gsearch3.fit(train[predictors],train[target])
gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_
Out[28]:
This shows that our original value of gamma, i.e. 0.1 is the optimum one. Before proceeding, a good idea would be to re-calibrate the number of boosting rounds for the updated parameters. so the best parameters for now is :
In [5]:
param_test4 = {
'subsample':[i/10.0 for i in range(6,10)],
'colsample_bytree':[i/10.0 for i in range(6,10)]
}
gsearch4 = GridSearchCV(estimator = XGBRegressor( learning_rate =0.1, n_estimators=140, max_depth=5,
min_child_weight=6, gamma=0.1, subsample=0.8, colsample_bytree=0.8, nthread=4, scale_pos_weight=1,seed=27),
param_grid = param_test4, scoring='neg_mean_squared_error',n_jobs=4,iid=False, cv=5)
gsearch4.fit(train[predictors],train[target])
gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_
Out[5]:
Here, we found 0.8,0.8 as the optimum value for both subsample and colsample_bytree. Now we should try values in 0.05 interval around these.
In [6]:
param_test5 = {
'subsample':[i/100.0 for i in range(70,95,5)],
'colsample_bytree':[i/100.0 for i in range(70,95,5)]
}
gsearch5 = GridSearchCV(estimator = XGBRegressor( learning_rate =0.1, n_estimators=140, max_depth=5,
min_child_weight=6, gamma=0.1, subsample=0.8, colsample_bytree=0.8, nthread=4, scale_pos_weight=1,seed=27),
param_grid = param_test5, scoring='neg_mean_squared_error',n_jobs=4,iid=False, cv=5)
gsearch5.fit(train[predictors],train[target])
gsearch5.grid_scores_, gsearch5.best_params_, gsearch5.best_score_
Out[6]:
In [8]:
param_test6 = {
'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100],
'reg_lambda':[1e-5, 1e-2, 0.1, 1, 100]
}
gsearch6 = GridSearchCV(estimator = XGBRegressor( learning_rate =0.1, n_estimators=140, max_depth=5,
min_child_weight=6, gamma=0.1, subsample=0.85, colsample_bytree=0.85, nthread=4, scale_pos_weight=1,seed=27),
param_grid = param_test6, scoring='neg_mean_squared_error',n_jobs=4,iid=False, cv=5)
gsearch6.fit(train[predictors],train[target])
gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_
Out[8]:
the values tried are very widespread, we should try values closer to the optimum here (0.1) to see if we get something better.
In [10]:
param_test8 = {
'reg_alpha':[0.2,0.5,0.7,1,2,5,10],
'reg_lambda':[50,100,300,500]
}
gsearch8 = GridSearchCV(estimator = XGBRegressor( learning_rate =0.1, n_estimators=140, max_depth=5,
min_child_weight=6, gamma=0.1, subsample=0.85, colsample_bytree=0.85, nthread=4, scale_pos_weight=1,seed=27),
param_grid = param_test8, scoring='neg_mean_squared_error',n_jobs=4,iid=False, cv=5)
gsearch8.fit(train[predictors],train[target])
gsearch8.grid_scores_, gsearch8.best_params_, gsearch8.best_score_
Out[10]:
You can see that we got a best value which are 5 and 50. Now we can apply this regularization in the model and look at the impact:
In [ ]:
param_test7 = {
'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05],
'reg_lambda':[0, 0.001, 0.005, 0.01, 0.05]
}
gsearch7 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=4,
min_child_weight=6, gamma=0.1, subsample=0.8, colsample_bytree=0.8, nthread=4, scale_pos_weight=1,random_state=27),
param_grid = param_test7, scoring='neg_mean_squared_error',n_jobs=4,iid=False, cv=5)
gsearch7.fit(train[predictors],train[target])
gsearch7.grid_scores_, gsearch7.best_params_, gsearch7.best_score_
adding all tuned model below and show importance
In [ ]:
xgb1 = XGBRegressor(
learning_rate =0.1,
n_estimators=1000,
max_depth=5,
min_child_weight=6,
gamma=0.1,
subsample=0.85,
colsample_bytree=0.85,
nthread=4,
reg_alpha=5,
reg_lambda=50,
scale_pos_weight=1,
seed=27)
xgb1.fit(train[predictors],train[target])
In [17]:
feat_imp = pd.Series(xgb1.booster().get_fscore()).sort_values(ascending=False)
plt.figure(figsize=(15,5))
feat_imp.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')
Out[17]:
From the figure above, we can see that taxamount, structure tax value dollar count and the location of the house is very important for predict the logerror. However, there is another variable called region county has less importance, probably because the information it contains has been added into model by other variables which can convey location message.
In [6]:
from sklearn.externals import joblib
median_xgb = joblib.load("xgb_median.m")
plt.show()
xgb.plot_tree(median_xgb, num_trees=0, rankdir='LR')
Out[6]:
This is the interestest part, we use TensorFlow to build a neuro network which has four hidden layer, each hidden layer has 20, 15 ,10, 5 neuro( the feature number is 24). We have tried different parameters, train iteration, and the answer show that with higher train itereation, the predicted answer will not perform better, and the prediciton value has logerror 0.067474, which does not ourperform XGBoost. We believe the reason is that neuro net work does not have obvious advantage when operate on simple problem, although the answer is good enough.
In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from tensorflow.contrib import learn
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn import datasets, linear_model
from sklearn import cross_validation
import numpy as np
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2, random_state=42)
total_len = X_train.shape[0]
# Parameters
learning_rate = 0.001
training_epochs = 500
batch_size = 10
display_step = 1
dropout_rate = 0.9
# Network Parameters
n_hidden_1 = 20 # 1st layer number of features
n_hidden_2 = 15 # 2nd layer number of features
n_hidden_3 = 10
n_hidden_4 = 5
n_input = X_train.shape[1]
n_classes = 1
# tf Graph input
x = tf.placeholder("float", [None, 24])
y = tf.placeholder("float", [None])
# Create model
def multilayer_perceptron(x, weights, biases):
# Hidden layer with RELU activation
layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
layer_1 = tf.nn.relu(layer_1)
# Hidden layer with RELU activation
layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
layer_2 = tf.nn.relu(layer_2)
# Hidden layer with RELU activation
layer_3 = tf.add(tf.matmul(layer_2, weights['h3']), biases['b3'])
layer_3 = tf.nn.relu(layer_3)
# Hidden layer with RELU activation
layer_4 = tf.add(tf.matmul(layer_3, weights['h4']), biases['b4'])
layer_4 = tf.nn.relu(layer_4)
# Output layer with linear activation
out_layer = tf.matmul(layer_4, weights['out']) + biases['out']
return out_layer
# Store layers weight & bias
weights = {
'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1], 0, 0.1)),
'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2], 0, 0.1)),
'h3': tf.Variable(tf.random_normal([n_hidden_2, n_hidden_3], 0, 0.1)),
'h4': tf.Variable(tf.random_normal([n_hidden_3, n_hidden_4], 0, 0.1)),
'out': tf.Variable(tf.random_normal([n_hidden_4, n_classes], 0, 0.1))
}
biases = {
'b1': tf.Variable(tf.random_normal([n_hidden_1], 0, 0.1)),
'b2': tf.Variable(tf.random_normal([n_hidden_2], 0, 0.1)),
'b3': tf.Variable(tf.random_normal([n_hidden_3], 0, 0.1)),
'b4': tf.Variable(tf.random_normal([n_hidden_4], 0, 0.1)),
'out': tf.Variable(tf.random_normal([n_classes], 0, 0.1))
}
# Construct model
pred = multilayer_perceptron(x, weights, biases)
pred = tf.transpose(pred)
# Define loss and optimizer
cost = tf.reduce_mean(tf.square(pred-y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
# Launch the graph
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
# Training cycle
for epoch in range(training_epochs):
avg_cost = 0.
total_batch = int(total_len/batch_size)
# Loop over all batches
for i in range(total_batch-1):
batch_x = X_train[i*batch_size:(i+1)*batch_size]
batch_y = Y_train[i*batch_size:(i+1)*batch_size]
# Run optimization op (backprop) and cost op (to get loss value)
_, c, p = sess.run([optimizer, cost, pred], feed_dict={x: batch_x,
y: batch_y})
# Compute average loss
avg_cost += c / total_batch
# sample prediction
label_value = batch_y
estimate = p
err = label_value-estimate
print ("Optimization Finished!")
predicted_vals = sess.run(pred, feed_dict={x: X_test})
In this project, we have tried 3 methods. The one having best results is XGboost with missing value filled by mode. This dataset is from Kaggle and we donot have the true response of test dataset. So we can just using the score which is given by Kaggle Website after uploading our predictions to get the accuracy of models.
Finally, we choose 21 features and XGboost with mode filling missing data. Its tuned parameters are showed in part 3.2 and its barplot for the importance of features as well as XGboost tree are showed below.
In [9]:
from sklearn.externals import joblib
mode_xgb = joblib.load("xgb_final.m")
plt.show()
xgb.plot_tree(mode_xgb, num_trees=0, rankdir='LR')
Out[9]:
In [34]:
feat_imp = pd.Series(mode_xgb.booster().get_fscore()).sort_values(ascending=False)
plt.figure(figsize=(15,5))
#feat_imp = pd.Series(xgb1.booster().get_fscore()).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score of final model')
Out[34]:
from the feature importance plot we can find out that taxamount and structuretaxvaluedollar plays the most important role, which pretty make sense, since the higher value price tend to have higher tax, so they are closely related. In addition, longitude, latitude and zip code also have high importance, the possible reason is that housing price in same area might be similar, so when the location is similar, the housing price prediction accuracy might similar too.
In [ ]: